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Abstract 

We address the issue of using mini-batches in 
stochastic optimization of SVMs. We show 
that the same quantity, the spectral norm of 
the data, controls the parallelization speedup 
obtained for both primal stochastic subgradi- 
ent descent (SGD) and stochastic dual coor- 
dinate ascent (SCDA) methods and use it to 
derive novel variants of mini-batched SDCA. 
Our guarantees for both methods are ex- 
pressed in terms of the original nonsmooth 
primal problem based on the hinge-loss. 



1. Introduction 

Stochastic optimization approaches have been shown 
to have significant theoretical and empirical advan- 
tages in training linear Support Vector Machines 
(SVMs), as well as in many other learning applica- 
tions, and are often the methods of choice in prac- 
tice. Such methods use a single, randomly chosen, 
training example at each iteration. In the context of 
SVMs, approaches of this form include primal stochas- 
tic gradient descent (SGD) methods (e.g., Pegasos, 
Shalcv-Shwartz ct al. 2011, NORMA, Zhang 2004) 
and dual stochastic coordinate ascent (Hsieh et al., 
2008). 

However, the inherent sequential nature of such ap- 
proaches becomes a problematic limitation for parallel 
and distributed computations as the predictor must 
be updated after each training point is processed, pro- 
viding very little opportunity for parallelization. A 
popular remedy is to use mini-batches. That is, to use 
several training points at each iteration, instead of just 



one, calculating the update based on each point sep- 
arately and aggregating the updates. The question is 
then whether basing each iteration on several points 
can indeed reduce the number of required iterations, 
and thus yield parallelization speedups. 

In this paper, we consider using mini-batches with 
Pegasos (SGD on the primal objective) and with 
Stochastic Dual Coordinate Ascent (SDCA). We show 
that for both methods, the quantity that controls the 
speedup obtained using mini-batching/parallelization 
is the spectral norm of the data. 

In Section 3 we provide the first analysis of mini- 
batched Pegasos (with the original, non-smooth, 
SVM objective) that provably leads to parallelization 
speedups (Theorem 1). The idea of using mini-batches 
with Pegasos is not new, and is discussed already by 
Shalev-Shwartz et al. (2011), albeit without a theo- 
retical justification. The original Pegasos theoretical 
analysis does not benefit from using mini-batches — the 
same number of iterations is required even when large 
mini-batches are used, there is no speedup, and the 
serial runtime (overall number of operations, in this 
case data accesses) increases linearly with the mini- 
batch size. In fact, no parallelization speedup can be 
guaranteed based only on a bound on the radius of the 
data, as in the original Pegasos analysis. Instead, we 
provide a refined analysis based on the spectral norm 
of the data. 

We then move on to SDCA (Section 4). We show 
the situation is more involved, and a modification to 
the method is necessary. SDCA has been consistently 
shown to outperform Pegasos in practice (Hsieh et al., 
2008; Shalcv-Shwartz et al., 2011), and is also popular 
as it does not rely on setting a step-size as in Pegasos. 
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It is thus interesting and useful to obtain mini-batch 
variants of SDCA as well. We first show that a naive 
mini-batching approach for SDCA can fail, in partic- 
ular when the mini-batch size is large relative to the 
spectral norm (Section 4.1). We then present a "safe" 
variant of mini-batched SDCA, which depends on the 
spectral norm, and an analysis for this safe variant that 
establishes the same spectral-norm-dependent paral- 
lelization speedups as for Pegasos (Section 4.2). Simi- 
lar to a recent analysis of non-mini-batched SDCA by 
Shalev-Shwartz & Zhang (2012), we establish a guar- 
antee on the duality gap, and thus also on the sub- 
optimality of the primal SVM objective, when us- 
ing mini-batched SDCA (Theorem 2). We then go on 
to describe a more aggressive, adaptive, method for 
mini-batched SDCA, which is based on the analysis of 
the "safe" approach, and which we show often outper- 
forms it in practice (Section 4.3, with experiments in 
Section 5). 

For simplicity of presentation we focus on the hinge 
loss, as in the SVM objective. However, all our results 
for both Pegasos and SDCA are valid for any Lipschitz 
continuous loss function. 



Related Work. Several recent papers consider the 
use of mini-batches in stochastic gradient descent, 
as well as stochastic dual averaging and stochas- 
tic mirror descent, when minimizing a smooth loss 
function (Dckcl ct al., 2012; Agarwal & Duchi, 2011; 
Cotter et al., 2011). These papers establish paral- 
lelization speedups for smooth loss minimization with 
mini-batches, possibly with the aid of some "acceler- 
ation" techniques, and without relying on, or consid- 
ering, the spectral norm of the data. However, these 
results do not apply to SVM training, where the ob- 
jective to be minimized is the non-smooth hinge loss. 
In fact, the only data assumption in these papers is 
an assumption on the radius of the data, which is not 
enough for obtaining parallclization guarantees when 
the loss is non-smooth. Our contribution is thus or- 
thogonal to these papers, showing that it is possible 
to obtain parallelization speedups even for non-smooth 
objectives, but only with a dependence on the spectral 
norm. We also analyze SDCA, which is a substantially 
different method from the methods analyzed in these 
papers. It is interesting to note that a bound of the 
spectral norm could perhaps indicate that it is eas- 
ier to "smooth" the objective, and thus allow obtain- 
ing results similar to ours (i.e. on the suboptimality of 
the original non-smooth objective) by smoothing the 
objective and relying on mini-batched smooth SGD, 
where the spectral norm might control how well the 
smoothed loss captures the original loss. But we are 



not aware of any analysis of this nature, nor whether 
such an analysis is possible. 

There has been some recent work on mini-batched co- 
ordinate descent methods for ^-regularized problems 
(and, more generally, regularizes by a separable con- 
vex function) , similar to the SVM dual. Bradley et al. 
(2011) presented and analyzed SHOTGUN, a paral- 
lel coordinate descent method for ^-regularized prob- 
lems, showing linear speedups for mini-batch sizes 
bounded in terms of the spectral norm of the data. 
The analysis does not directly apply to the SVM dual 
because of the box constraints, but is similar in spirit . 
Furthermore, Bradley et al. (2011) do not discuss a 
"safe" variant which is applicable for any mini-batch 
size, and only study the analogue of what we re- 
fer to as "naive" mini-batching (Section 4.1). More 
directly related is recent work of Richtarik & Takac 
(2013; 2012) which provided a theoretical framework 
and analysis for a more general setting than SHOT- 
GUN, that includes also the SVM dual as a special 
case. However, guarantees in this framework, as well 
as those of Bradley et al. (2011), are only on the dual 
suboptimality (in our terminology), and not on the 
more relevant primal suboptimality, i.e., the subop- 
timality of the original SVM problem we are inter- 
ested in. Our theoretical analysis builds on that of 
Richtarik & Takac (2012), combined with recent ideas 
of Shalev-Shwartz & Zhang (2012) for "standard" (se- 
rial) SDCA, to obtain bounds on the duality gap and 
primal suboptimality. 

2. Support Vector Machines 

We consider the optimization problem of training a 
linear 1 Support Vector Machine (SVM) based on n la- 
beled training examples {(x*, yi)}£=i, where x, £ R d 
and yt £ ±1. We use X = [x 1; . . . , x„] £ R dxn to de- 
note the matrix of training examples. We assume the 
data is normalized such that maxj ||xj|| < 1, and thus 
suppress the dependence on maxi ||xj || in all results. 
Training a SVM corresponds to finding a linear predic- 
tor w £ M. d with low ^2-norm ||w|| and small (empir- 
ical) average hinge loss L(w) := \ Yn=i ^{Vi ( w > x i})> 
where £(z) := [1 — z]+ = max{0, 1 — z}. This bi- 
objective problem can be serialized as 



mm 



P(w) 



n / j 
i=l 



w 



(1) 



1 Since both Pegasos and SDCA can be kernelized, all 
methods discussed are implementable also with kernels, 
and all our results hold. However, the main advantage 
of SGD and SDCA is where the feature map is given ex- 
plicitly, and so we focus our presentation on this setting. 
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where A > is a regularization trade-off parameter. It 
is aiso usefui to consider the dual of (1): 



max 

aeR™,0<cti<l 



D(a) 



where 



Q e 



i^« T Q« + iE Q * • ( 2 ) 

i=l 



Q, x,.x, . (3) 



is the Gram matrix of the (labeled) data. The (primal) 
optimum of (1) is given by w* = j^Yd=i a iyi x i> 
where a* is the (dual) optimum of (2). It is thus 
natural to associate with each dual solution a. a primal 
solution (i.e., a linear predictor) 



w(a) 



i=l 



(4) 



We will be discussing "mini-batches" of size b, repre- 
sented by random subsets A C (n) := {1,2, ... ,n} of 
examples, drawn uniformly at random from all sub- 
sets of (n) of cardinality b. Whenever we draw such a 
subset, we will for simplicity write A € Rand(fo). For 
A G Rand(6) we use Qa G K bxb to denote the random 
submatrix of Q corresponding to rows and columns 
indexed by A, va G K b to denote a similar restriction 



of a vector v G 



and v^j G R" for the "censored" 



vector where entries inside A are as in v and entries 
outside A are zero. The average hinge loss on examples 
in A is denoted by 



L A (yf) 



(5) 



3. Mini-Batches in Primal Stochastic 
Gradient Descent Methods 

Algorithm 1 Pegasos with Mini-Batches 

Input: {(Xi.j/O}?^, A > 0, & G (n), T > 1 
Initialize: set wW = G R d 
for t = 1 to T do 

Choose random mini-batch A t G Rand(6) 
Vt = Yt, A+ = {i G A t : yi(wW,Xi> < 1} 
w( t + 1 ) = (l- ??f A)w( t ) + ^E ie A+^ 
end for 

Output: w( T ) = f £L T/2 | +1 w« 



Pegasos is an SGD approach to solving (1), where at 
each iteration the iterate is updated based on an 
unbiased estimator of a sub-gradient of the objective 
P(w). Whereas in a "pure" stochastic setting, the sub- 
gradient is estimated based on only a single training 



example, in our mini-batched variation (Algorithm 1) 
at each iteration we consider the partial objective: 



P t (w) := L At (w) + | ||w|| 



(6) 



where A t G Rand(&). We then calculate the subgradi- 
ent of the partial objective Pt at w^: 

V w := VP f (w«) ( = ] VL At (™ {t) ) + Aw», (7) 
where 

(8) 



VL A (w) = --i^x 4 (w)2/ ? :x t 



ieA 



and Xj( w ) := 1 if Vi ( w : x i) < 1 an d otherwise (in- 
dicator for not classifying example i correctly with 
a margin). The next iterate is obtained by setting 
w (t+i) = w (t) _ r/ t 'V l - t \ We can now write 

w(t+1) (7)+(8) (1 _ r?tA)w(t)+ ^ *(w<*W (9) 

i&At 

Analysis of mini-batched Pegasos rests on bounding 
the norm of the subgradient estimates V'*'. An un- 
conditional bound on this norm, used in the standard 
Pegasos analysis, follows from bounding 

||VMw)|| < i ]T ||Xi(w)»Xi|| < \ J2 1 = L 



i£A 



From (7) we then get ||V (t) || < A||w (() || + 1; the stan- 
dard Pegasos analysis follows. This bound relies only 
on the assumption max^ ||xj|| < 1, and is the tightest 
bound without further assumptions on the data. 

The core novel observation here is that the expected 
(square) norm of VL^ can be bounded in terms of (an 
upper bound on) the spectral norm of the data: 



a 2 > h 11X11 



n 11 / j 



Wll^llQH, (10) 



where ||-|| denotes the spectral norm (largest singular 
value) of a matrix. In order to bound VLa, we first 
perform the following calculation, introducing the key 
quantity useful also in the analysis of SDCA. 

Lemma 1. For anyv G R n , Q G M nx ", A G Rand(6), 

n 

E[v^Qv [X ]] = B 1 H)E^ + ^v T Qv]. 

»=i 

Moreover, i/ Q« < 1 for all i and — ||Q|| < a 2 , then 
E[v^]Qv [A] ] < £A,||v|| 2 , where 
Pb := 1 



(b-l)(n<T -1) 
n-1 



(11) 
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Proof. 

E [ V [A]Q V [A]] = + E WjQij] 

( =' bEiivlQu] + 6(6 - ljE^ViVj-Qy] 

= I E Q«*? + 5^7 vT (Q - diag(Q))v 

i 

where in (*) the expectations are over i,j chosen uni- 
formly at random without replacement. Now using 
Qiz < 1 and ||Q || < na 2 , we can upper-bound the 
expectation as follows: 

< - fer) HI 2 + fer^ 2 IMI a ] = INI 2 

We can now apply Lemma 1 to Vi^: 
Lemma 2. For any w £ R d and A £ Rand(6) we have 

b 



IE [ 1 1 VL^w)!! 2 ] < where /?{, is as m Lemma 1. 



Proof. If x £ is the vector with entries \i ( w ) > then 
E[||VL A (w)|| 2 ] ( ^ll|Ex^ x *ll 2 ] 



(•3) 



igA 
(Leml) 



= ; ^nxJ A] Qx [A] ] ' < '^ b ||xll 2 <^ 



□ 



Using the by-now standard analysis of SGD for 
strongly convex functions, we obtain the main result 
of this section: 

Theorem 1. After T iterations of Pegasos with mini- 
batches (Algorithm 1), we have that for the averaged 
iterate w' T ' = |- J2t=[T/2]+i w ® •' 



2 



P(w( T >) - inf P(w) < & ■ #. 

J weK d 



Proof. Unrolling (9) with r\ t = l/(Af) yields 

t-i 

w(«) = 1 V ffM 

A(t-l) y 



(12) 



T=l 



where g 1 -^ := VIa t (w' t '). Using the inequality 
II £t=W T) H 2 <(*"!) Et=i IIS (T) II 2 < we now get 



( 12 ) t4 -nuWiiai ( Lcm2 ) 



|w«|HVE^|? v < 



/3b 



(13) 



r=l 



(7) + (Lem2) , , „ (13) „ 

E[||V (t) || 2 ] < 2(A 2 E[||w^|| 2 ] + f ) < 4&. 

The performance guarantee is now given by the 
analysis of SGD with tail averaging (Theorem 5 of 
Rakhlin et al. 2012, with a = \ and G 2 = 4^). □ 



Parallelization speedup. When 6 = 1 we have 
f3b = 1 (see (11)) and Theorem 1 agrees with the stan- 
dard (serial) Pegasos analysis 2 (Shalev-Shwartz et al., 
2011). For larger mini-batches, the guarantee depends 
on the quantity /3b, which in turn depends on the spec- 
tral norm a 2 . Since i < <r 2 < 1, we have 1 < /3& < 6. 

The worst-case situation is at a degenerate extreme, 
when all data points lie on a single line, and so a 2 = 1 
and Pb = b. In this case Lemma 2 degenerates to 
the worst-case bound of E[|| VLa(w) || 2 ] < 1, and in 
Theorem 1 we have ^ = 1 , indicating that using larger 
mini-batches docs not help at all, and the same number 
of iteration (i.e., the same parallel runtime, and 6 times 
as much serial runtime) is required. 

However, when er 2 < 1, and so /?t < 1, we sec a benefit 
in using mini-batches in Theorem 1, corresponding to 
a parallelization speedup of The best situation is 

when er 2 = ^, and so /?& = 1, which happens when 
all training points are orthogonal. In this case there 
is never any interaction between points in the mini- 
batch, and using a mini-batch of size 6 is just as effec- 
tive as making 6 single-example steps. When /3ft = 1 
we indeed see that the speedup speedup is equal to 
the number of mini-batches, and that the behavior in 
terms of the number of data accesses (equivalently, se- 
rial runtime) 6T, does not depend on 6; that is, even 
with larger mini-batches, we require no more data ac- 
cesses, and we gain linearly from being able to perform 
the accesses in parallel. The case a 2 = ^ is rather ex- 
treme, but even for intermediate values — < a 2 < 1 we 

n i 

get speedup. In particular, as long as b < -y. we have 
0b < 2, and an essentially linear speedup. Roughly 
speaking, — ^ captures the number of examples in the 
mini-batch beyond which we start getting significant 
interactions between points. 

4. Mini-Batches in Dual Stochastic 
Coordinate Ascent Methods 

An alternative stochastic method to Pegasos 
is Stochastic Dual Coordinate Ascent (SDCA, 
Hsieh et al. 2008), aimed to solve the dual problem 
(2). At each iteration we choose a single training 
example (xi,2/j), uniformly at random, corresponding 
to a single dual variable (coordinate) on = ejet. 
Subsequently, oti is updated so as to maximize the 
(dual) objective, keeping all other coordinates of a 
unchanged and maintaining the box constraints. At 



2 Except that we avoid the logarithmic factor by relying 
on tail averaging and a more modern SGD analysis. 
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iteration t, the update s\ to is computed via 
8^ := argmax T>(a^ + 8&j) 

0<a < f > +8<\ 

( = 5 argmax (Xn - (Qe t ) T a {t) )S - %<S 2 

0<a\ t] +S<l 

= cli P[ _ a ( t)il _ a ( t)] ^ 

= ^[.atfli-ajO] H (14) 

where clip^ is projection onto the interval J. Vari- 
ables o;^*' 1 for j ^ « arc unchanged. Hence, a single 

iteration has the form a( t+1 ) = + Sf'ei. Sim- 
ilar to a Pegasos update, at each iteration a single, 
random, training point is considered, the "response" 
yi (w(aW),Xi) is calculated (this operation dominates 
the computational effort), and based on the response, 
a multiple of x, is added to the weight vector w (cor- 
responding to changing a*). The two methods thus 
involve fairly similar operations at each iteration, with 
essentially identical computational costs. They differ 
in that in Pegasos, cc; is changed according to some 
pre-determined step-size, while SDCA changes it opti- 
mally so as to maximize the dual objective (and main- 
tain dual feasibility); there is no step-size parameter. 

SDCA was suggested and studied empirically by 
Hsieh ct al. (2008), where empirical advantages over 
Pegasos were often observed. In terms of a theo- 
retical analysis, by considering the dual problem (2) 
as an ^-regularized, box-constrained quadratic prob- 
lem, it is possible to obtain guarantees on the dual 
suboptimality, D(a*) — D(a«), after a finite num- 
ber of SDCA iterations (Shalev-Shwartz & Tewari, 
2011; Nestcrov, 2012; Richtarik & Takac, 2013). How- 
ever, such guarantees do not directly imply guaran- 
tees on the primal suboptimality of w(a^). Recently, 
Shalev-Shwartz & Zhang (2012) bridged this gap, and 
provided guarantees on P(w(qW)) — P(w*) after a 
finite number of SDCA iterations. These guarantees 
serve as the starting point for our theoretical study. 

4.1. Naive Mini-Batching 

A naive approach to parallelizing SDCA using mini- 
batches is to compute 8^' in parallel, according to 
(14), for all i G A t , all based on the current iterate 
a®, and then update a- t+1 ^ = af' + 8^ for i 6 At, 
and keep otj = ocj for j G" A t . However, not only 
might this approach not reduce the number of required 
iterations, it might actually increase the number of 
required iterations. This is because the dual objective 
need not improve monotonically (as it does for "pure" 
SDCA), and even not converge. 



To see this, consider an extreme situation with only 
two identical training examples: Q=[JJ],A = i = i 
and mini-batch size 6 = 2 (i.e., in each iteration 
we use both examples). If we start with = 
with D(a(°>) = then sf ] = 8^ = 1 and follow- 
ing the naive approach we have e* (1) = (1, 1) T with 
objective value D(a^) = 0. In the next iteration 
8^ = 8^ = — 1 which brings us back to = 0. 
So the algorithm will alternate between those two so- 
lutions with objective value D(a) = 0, while at the 
optimum D(a*) = D((0.5, 0.5) T ) = 0.25. 

This is of course a simplistic toy example, but the same 
phenomenon will occur when a large number of train- 
ing examples are identical or highly correlated. This 
can also be observed empirically in some of our exper- 
iments discussed later, e.g., in Figure 2. 

The problem here is that since we update each a.i in- 
dependently to its optimal value as if all other coordi- 
nates were fixed, we are ignoring interactions between 
the updates. As we see in the extreme example above, 
two different i,j G A t , might suggest essentially the 
same change to w(cr*)), but we would then perform 
this update twice, overshooting and yielding a new it- 
erate which is actually worse then the previous one. 

4.2. Safe Mini-Batching 

Properly accounting for the interactions between co- 
ordinates in the mini-batch would require jointly op- 
timizing over all on, i G A t . This would be a very 
powerful update and no-doubt reduce the number of 
required iterations, but would require solving a box- 
constrained quadratic program, with a quadratic term 
of the form 8^Qa8a, 8a G K 6 , at each iteration. This 
quadratic program cannot be distributed to different 
machines, each handling only a single data point. 

Instead, we propose a "safe" variant, where the term 
8 a ^Qa8a is approximately bounded by the separable 
surrogate /3 \\8a\\ , for some j3 > which we will dis- 
cuss later. That is, the update is given by: 

S\ t] := argmax (An- (Q ei ) T a w )S- §<5 2 

,. \n(l~yi(w(a (t) ),x i )) ., 

= ch P[- a j",i- a <'>] v >( 15 ) 

with c*f +1) = aP+sV for i G A t , and a { ' +1) = a { f> 

ILL J J 

for j A t . In essence, -* serves as a step-size, where 
we are now careful not to take steps so big that they 
will accumulate together and overshoot the objective. 
If handling only a single point at each iteration, such 
a short-step approach is not necessary, we do not need 
a step-size, and we can take a "full step", setting oti 
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optimally (8 = 1). But with the potential for interac- 
tion between coordinates updated in parallel, we must 
use a smaller step, depending on the potential for such 
interactions. 

We will first rely on the bound (10), and establish that 
the choice 8 = 8b as in (11) provides for a safe step 
size. To do so, we consider the dual objective at a + 8, 

n 

V(a+8) = - (" T Q"+^W T Q^) a±gt, (16) 

i=l 

and the following separable approximation to it: 



(17) 

in which 8b ||<5|| replaces <5 T Q<$. Our update (15) with 
8 = Bh can be written as 8 = are max H(<5, a) 

8:0<a+S<l 

(we then use the coordinates 8i for i E A and ignore 
the rest). We are essentially performing parallel coor- 
dinate ascent on the separable approximation H(<5, a) 
instead of on D(a + 8). To understand this approx- 
imation, we note that H(0,a) = D(«), and show 
that H(<5, ex) provides an approximate expected lower 
bound on D(a + 8): 

Lemma 3. For any a, 8 E R n and A E Rand(fe), 
E A [D(a + S [A] )} > (1 - i)D(a) + £11(5, a). 

Proof. Examining (16) and (17), the terms that do not 
depend on 8 are equal on both sides. For the linear 
term in 8, we have that E^m] = ^-8, and again we 
have equality on both sides. For the quadratic term we 
use Lemma 1 which yields E[5uiQ5[.a]] < ^Bb \\°~\\ 2 , 
and after negation establishes the desired bound. □ 

Inequalities of this general type are also studied in 
(Richtarik & Takac, 2012) (see Sections 3 and 4). 
Based on the above lemma, we can modify the anal- 
ysis of Shalev-Shwartz & Zhang (2012) to obtain (see 
complete proof in the appendix): 

Theorem 2. Consider the SDCA updates given by 
(15), with A t E Rand(&), starting from = and 
with B = 8b (given in eq. (H)J. For any e > and 



> 



to 

To > 
T > T, 



max{0, log(^f )]}, 



*o + t 



2Art' 

h 

4 o n 

Xe Z /3,, 



+ max{[fl,f 



T-l 



a 



.(*) 



(18) 
(19) 
(20) 

(21) 



E[P(w(a))] - P(w*) < E[P(w(a)) - D(a)] < e. 

The number of iterations of mini-batched SDCA, suffi- 
cient to reach primal suboptimality e, is by Theorem 2 
equal to 

6(t + £-£)- (22) 

We observe the same speedup as in the case of mini- 
batched Pegasos: factor of j^, with an essentially lin- 
ear speedup when b < -i? . It is interesting to note that 
the quantity 8b only affects the second, e-dependent, 
term in (22). The "fixed cost" term, which essentially 
requires a full pass over the data, is not affected by 8b, 
and is always scaled down by b. 

4.3. Aggressive Mini-Batching 

Using 8 = 8(j is safe, but might be too 
safe/conservative. In particular, we used the spec- 
tral norm to bound 5 T Q8 < ||Q|| ||<5|| 2 in Lemma 3 
(through Lemma 1), but this is a worst case bound 
over all possible vectors, and might be loose for the rel- 
evant vectors 6. Relying on a worst-case bound might 
mean we are taking much smaller steps then wc could 
be. Furthermore, the approach wc presented thus far 
relies on knowing the spectral norm of the data, or at 
least a bound on the spectral norm (recall (10)), in 
order to set the step-size. Although it is possible to 
estimate this quantity by sampling, this can certainly 
be inconvenient. 

Instead, we suggest a more aggressive variant of mini- 
batched SDCA which gradually adapts 8 based on the 



actual values of ||<5^ t j|| 2 and <5r^ ,Q(5u i. In Section 5 
one can observe advantages of this aggressive strategy. 

In this variant, at each iteration we calculate the ra- 
tio 5[a]Q^U]/I[^[a] II 2 : an d nudge the step size towards 
it by updating it to a weighted geometric average of 
the previous step size and "optimal" step size based on 
the step 8 considered. One complication is that due to 
the box constraints, not only the magnitude but also 
the direction of the step 8 depends on the step-size 8, 
leading to a circular situation. The approach we take 
is as follows: we maintain a "current step size" 8. At 
each iteration, we first calculate a tentative step 8a, 
according to (15), with the current B. We then calcu- 

— ^ S \t 1 , according to this step direction, and 



t(t) 



late p 



'[A]| 



t=T 



update 8 to B~' p 1 " 7 for some pre-determined parame- 
ter < 7 < 1 that controls how quickly the step-size 
adapts. But, instead of using 8 calculated with the 
previous 8, we actually re-compute 8a using the step- 
size p. We note that this means the ratio p does not 
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Algorithm 2 SDCA with Mini-Batches (aggressive) 
Input: {(x l , 2/l )}r=i,A>0,6GR d ,T> 1, 7 = 0.95 

Initialize: set a' ' = 0, = 0, /3<°) = f3 b 
for t = to T do 

Choose At £ Rand(6) 

For iGif, compute 5j from (15) using /3 = 0® 
~ 2 

Sum C := J] teAf <5, and A := £ iGi4t <^ x * 

/||A|| 

Compute p = chpjj ^ I -"-^ 

For i £ At, compute 8i from (15) using (3 = p. 

p(t+i) (/3 (t) )7p i- 7 

if D(aW + 5 [At] ) > D(aW) then 

W (t + D = w « + _L Ej£At SiyiXi 
else 

a (t+i) = a «, w(* +1 ) =wW 
end if 
end for 



correspond to the step #a actually taken, but rather 
to the tentative step 8 a- We could potentially con- 
tinue itcrativcly updating p according to 8a and 5a 
according to p, but we found that this does not im- 
prove performance significantly and is generally not 
worth the extra computational effort. This aggressive 
strategy is summarized in Algorithm 2. Note that we 
initialize j3 — fib, and also constrain j3 to remain in 
the range [1, /?;,], but we can use a very crude upper 
bound a 2 for calculating Also, in our aggressive 
strategy, we refuse steps that do not actually increase 
the dual objective, corresponding to overly aggressive 
step sizes. 

Carrying out the aggressive strategy requires comput- 
ing #uiQ5[A] and the dual objective efficiently and in 
parallel. The main observation here is that: 



<*[A]Q^f4] 



(23) 



and so the main operation to be performed is an ag- 
gregation of X^igA ^i2/i x ii similar to the operation re- 
quired in mini-batched Pcgasos. As for the dual objec- 
tive, it can be written as D(a) = — ||w(o;)|| 2 — i Ha]^ 
and can thus be readily calculated if we maintain 
w(a), its norm, and || ct || . 

5. Experiments 

Figure 1 shows the required number of iterations (cor- 
responding to the parallel runtime) required for achiev- 
ing a primal suboptimality of 0.001 using Pegasos, 



Table 1. Datasets and regularization parameters A used; 
"%" is percent of features which are non-zero, cov is the 
forest covertype dataset of Shalev-Shwartz et al. (2011), 
astro-ph consists of abstracts of papers from physics also 
of Shalev-Shwartz et al. (2011), rcvl is from the Reuters 
collection and news20 is from the 20 news groups both 
obtained from libsvm collection (Libsvm). 



Data 


# train 


# test 


# dim 


% 


A 


cov 


522,911 


58,101 


54 


22 


0.000010 


rcvl 


20,242 


677,399 


47,236 


0.16 


0.000100 


astro-ph 


29,882 


32,487 


99,757 


0.08 


0.000050 


news20 


15,020 


4,976 


1,355,191 


0.04 


0.000125 



naive SDCA, safe SDCA and aggressive SDCA, on 
four benchmark datasets detailed in Table 1, using 
different mini-batch sizes. Also shown (on an inde- 
pendent scale; right axis) is the leading term ^ in our 
complexity results. The results confirm the advantage 
of SDCA over Pegasos, at least for b = 1, and that 
both Pegasos and SDCA enjoy nearly-linear speedups, 
at least for small batch sizes. Once the mini-batch 
size is such that ^ starts flattening out (correspond- 
ing to b « -tt, and so significant correlations inside 
each mini-batch), the safe variant of SDCA follows a 
similar behavior and does not allow for much paral- 
lelization speedup beyond this point, but at least does 
not deteriorate like the naive variant. Pegasos and the 
aggressive variant do continue showing speedups be- 
yond b w The experiments clearly demonstrate 
the aggressive modification allows SDCA to continue 
enjoying roughly the same empirical speedups as Pega- 
sos, even for large mini-batch sizes, maintaining an ad- 
vantage throughout. It is interesting to note that the 
aggressive variant continues improving even past the 
point of failure of the naive variant, thus establishing 
that it is empirically important to adjust the step-size 
to achieve a balance between safety and progress. 

In Figure 2 we demonstrate the evolution of solutions 
using the various methods for two specific data sets. 
Here we can again see the relative behaviour of the 
methods, as well as clearly see the failure of the naive 
approach, which past some point causes the objective 
to deteriorate and does not converge to the optimal 
solution. 



6. Conclusion 

Contribution. Our contribution in this paper is 
twofold: (i) we identify the spectral norm of the 
data, and through it the quantity /3{,, as the im- 
portant quantity controlling guarantees for mini- 
batched/parallelized Pegasos (primal method) and 
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Batch size Batch size Batch size Batch size 



Figure 1. Number of iterations (left vertical axis) needed to find a 0.001-accurate primal solution for different mini-batch 
sizes b (horizontal axis). The leading factor in our analysis, is plotted on the right vertical axis. 



news20. b=256 news20. b=256 astro-ph, b=8192 astro-ph, b=8192 




10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 



Iterations Iterations Iterations Iterations 

Figure 2. Evolutions of primal (solid) and dual (dashed) sub-optimality and test error for news20 and astro-ph datasets. 
Instead of tail averaging, in the experiments we used decaying averaging with w w = O.^'' 1 ' + 0.1w (t) . 



SDCA (dual method). We provide the first analysis 
of mini-batched Pagasos, with the non-smooth hinge- 
loss, that shows speedups, and we analyze for the first 
time mini-batched SDCA with guarantees expressed in 
terms of the primal problem (hence, our mini-batched 
SDCA is a primal-dual method); (ii) based on our anal- 
ysis, we present novel variants of mini-batched SDCA 
which are necessary for achieving speedups similar to 
those of Pegasos, and thus open the door to effec- 
tive mini-batching using the often-empirically-better 
SDCA. 

Related work. Our safe SDCA mini-batching 
approach is similar to the parallel coordinate 
descent methods of Bradley et al. (2011) and 
Richtarik & Takac (2012), but we provide an analysis 
in terms of the primal SVM objective, which is 
the more relevant object of interest. Furthermore, 
Bradley et al.'s analysis does not use a step-size and 
is thus limited only to small enough mini-batches — 
if the spectral norm is unknown and too large a 
mini-batch is used, their method might not converge. 
Richtarik & Takac's method does incorporate a fixed 
step-size, similar to our safe variant, but as we discuss 
this step-size might be too conservative for achieving 
the true potential of mini-batching. 

Generality. We chose to focus on Pegasos and SDCA 
with regularized hinge-loss minimization, but all our 
results remain unchanged for any Lipschitz loss func- 
tions. Furthermore, Lemma 2 can also be used to es- 



tablish identical speedups for mini-batched SGD op- 
timization of min|| w ||< B L(w), as well as for direct 
stochastic approximation of the population objective 
(generalization error) mini(w). In considering the 
population objective, the sample size is essentially in- 
finite, we sample with replacements (from the popula- 
tion), a 2 is a bound on the second moment of the data 
distribution, and /3& = 1 + (b — l)a 2 . 

Experiments. Our experiments confirm the empiri- 
cal advantages of SDCA over Pegasos, previously ob- 
served without mini-batching. However, we also point 
out that in order to perform mini-batched SDCA ef- 
fectively, a step-size is needed, detracting from one of 
the main advantages of SDCA over Pegasos. Further- 
more, in the safe variant, this stcpsize needs to be set 
according to the spectral norm (or bound on the spec- 
tral norm), with too small a setting for j3 (i.e., too 
large steps) possibly leading to non-convergence, and 
too large a setting for j3 yielding reduced speedups. 
In contrast, the Pegasos stepsize is independent of the 
spectral norm, and in a sense Pegasos adapts implic- 
itly (see, e.g., its behavior compared to aggressive 
SDCA in the experiments). We do provide a more 
aggressive variant of SDCA, which does match Pega- 
sos's speedups empirically, but this requires an explicit 
heuristic adaptation of the stepsize. 

Parallel Implementation. In this paper we ana- 
lyzed the iteration complexity, and behavior of the 
iterates, of mini-batched Pegasos and SDCA. Un- 
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like "pure" (b=l) Pegasos and SDCA, which are 
not amenable to parallclization, using mini-batches 
does provide opportunities for it. Of course, actu- 
ally achieving good parallclization specdups on a spe- 
cific architecture in practice requires an efficient par- 
allel, possibly distributed, implementation of the it- 
erations. In this regard, we point out that the core 
computation required for both Pegasos and SDCA is 
that of computing ^(( w ' x *)) x *' wnere 9 is some 

scalar function. Parallelizing such computations effi- 
ciently in a distributed environment has been studied 
by e.g., Dekel et al. (2012); Hsu et al. (2011); their 
methods can be used here too. Alternatively, one 
could also consider asynchronous or delayed updates 
(Agarwal & Duchi, 2011; Niu ct al., 2011). 
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A. Proof of Theorem 2 



The proof of Theorem 2 follows mostly along the path of Shalev-Shwartz & Zhang (2012), crucially using 
Lemma 3, and with a few other required modifications detailed below. 

We will prove the theorem for a general L-Lipschitz loss function £(■). For consistency with 
Shalev-Shwartz & Zhang, we will also allow example-specific loss functions £j,, i = 1,2, ... ,n, and only require 
each £j be individually Lipschitz, and thus refer to the primal and dual problems (expressed slightly differently 
but equivalently) : 



mm 

weR d 



max 

ctGK" 



P ( w ) :=^]>>«w,x 4 » + f|| w | 

i=l 

n 

°(«) : =-^E^(-«i)-|||^X T a| 



(P) 
(D) 



where £*(u) = max z (zu — ^i( z )) is the Fenchel conjugate of li. In the above we dropped without loss of 
generality the labels yi since we can always substitute x, <— yjXi. For the hinge loss £i(a) = [1 — a]+ we have 
£*(~a) = —a for a G [0, 1] and £*(— a) = oo otherwise, thus encoding the box constraints. Recall also (from (4)) 

that w(a) = 5^ S"=i a * x « and so ll w ( a )l| 2 = a^o^XX 1 "" = llxr7 XTa ll ■ 

The separable approximation H(d, a) defined in (17) now has the more general form: 

n / n \ 

H(S, a) := -i £ e*(-( ai + 5J) | || w(a )f + £ || Xi f ^ + 2 (±^ T Xw(a) (24) 

4=1 \ 1=1 / 

and all the properties mentioned in Section 4, including Lemma 3, still hold. 
Our goal here is to get a bound on the duality gap, which we will denote by 



G(a) := P(w(a)) - D(a) = A £ fc«w(a), x. t )) + £*(-«,) + a< (w(a), Xi )] . 



(25) 



i=i 



The analysis now rests on the following lemma, paralleling Lemma 1 of Shalev-Shwartz & Zhang (2012), which 
bounds the expected improvement in the dual objective after a single iteration in terms of the duality gap: 

Lemma 4. For any t and any s 6 [0, 1] we have 

E At [D( a ^)} - D(a«) > b (iG(a«) - (^) 2 §G«) , (26) 

n 

G«:=i£]|x,|| 2 ( X f -a«) 2 <G, (27) 

i=l 

G = 4L /or general L-Lipschitz loss, and G = 1 /or i/ie hinge loss, and G ^";(( w ( Q; '' t '')7 x i)). 



Proof. The situation here is trickier then in the case 6=1 considered by Shalev-Shwartz & Zhang, and we will 
first bound the right hand side of (26) by H D and then use the fact that 6^' is a minimizer of H(-, a): 



E[D(«(' +1 »)]-D(qW)) = ~ (l[D( Q W +*[5 t] )]-D(oW)) (L ~ 3) -H(<5« aW)+D(a^) 



1 E(4*(-K (t) +^ t) ))-^(-«r ) ))+^f^ 

n 1=1 \ 



_L 5 w 

An 



2 (_L S «>) x.(„4 
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where we denote ||u||^ := J27=i u f ll x sl| 2 - We will now use the optimally of 5'*' to upper bound the above, noting 
that if we replace 8^ with any quantity, and in particular with s(x^ — ot^), we can only decrease H(-, a^), 
and thus increase the right-hand-side above: 



n 

<±£ [$(-(««<*> +.(xi*> -««)))- $(-o{*>) 



+ &'k W -« W ) x + 2(£*(x (t) -a (t) )J Xw(aW) 

Now from convexity wc have £*(— (aj + s(xf^ — oq *'))) < x| ) + (1 — s Ki ( — a i )j anc ^ so: 

n 

< £ E + s xf (w(« (t) ),x 4 ) - <(-af)) 

i=l 

+ I (A ||x^(x (t) ~ + 2 (^(-« (t) )) T Xw(a«)) 

and from conjugacy we have ^*(— Xi ) — — Xi (w(at^),Xi) — 4((w(a^), x ? ;}), and so: 

< f E (-^f (*(««),*) - A ((w(««),*)) + (w(a«) lXi ) 



A 



l^W-aW) +2 ^(-a (t) ) Xw(a 



x 



< ^E ((w(aW), Xi )) -ITC-ai* 5 ) - a« (w(a«), xA) + | A ^ S (x W - a«) 



i=i 



(25) 



. a G(aW) + ^(i) a (A|(xW-aW)|Q 



Multiplying both sides of the resulting inequality by we obtain (26). To get the bound on recall that £(■) 



is 



L-Lipschitz, hence — L < xf _• L. Furthermore, is dual feasible, hence t*{— oq ) < oo and so (— a\ ) 



is 



a (sub)derivative of £i and so we also have — L < otp < L and for each i, and (xf ~ a f')' 2 — ^L. For the hinge 



(*)_„(*ha 



loss we have < x| ,a,- < 1, and so (Xi — a i*^) 2 — - !• 
We are now ready to prove the theorem. 

Proof of Theorem 2. We will bound the change in the dual sub-optimality eg* := D(cc*) — D(a'''): 



□ 



E At [^ +1) ] = E[D(a«) - D(«(^)) + eg>] (L ~ ^ -6 (f G(a«) - (± ) 2 §g) + eg 



(0 



£ ™<G( a (*) 



< -6*$ + © i*G + eg) = (1 - **)eg> + 6 (1) (28) 



Unrolling this recurrence, we have: 



e[$] < a - 6*)*4 0) + ^) 2 f^Ea - < a - + (5) ^ 



2A 



i=0 



Setting s = 1 and 



to :=[^log(2Aneg ) /(G/3 & ))l]. 



(29) 
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yields: 



F r,(to)i < n 6yo,(0) ,*PbG G& 1 p b G _ l3 b G 

Following the proof of Shalev-Shwartz & Zhang we will now show by induction that 



Vt > t 



o 



(t) 2/3 b G 
[ DJ " A(2n + b(t - to)) ' 



(30) 



(31) 



Clearly (30) implies that (31) holds for t = to- Now, if it holds for some t > to, we show that it also holds for 
t + 1. Using s = 2n Jl l t _ to) in (28) we have: 



2/3 b G 



2/3 b G 



s\ 2 A 



(1-6 



■)- 



A(2n + 6(t-t )) W 2A 
2 A 

G 



G 



2n + 6(i-i ) X(2n + b(t - 1 )) \2n + b{t - t ) J 2A 

2G/3 b 



2Gft, (27i+b(t-«o)+h)(2n+6(t-t )-&) < _ 

A(2n+6(t-t )+I>) (2n+ b (*-t ))- - X ^ n + h (f _ ^) + & ) ' 

where in the last inequality we used the arithmetic-geometric mean inequality This establishes (31). 
Now, for the average a defined in (21) we have: 



E[G(a)] =E 
Applying Lemma 4 with s 



<T-1 



\t=T 



< 



T-T 



E 



T-l 



J2 G ( a 



t=T 



< 



b(T-T ) ■ 

nb(T - To) 1 



nb T -To 
< (p(a*) -E[D(a (To) )] 
(31) ( 2/? b G 



(e[D(c*( t >)] -E[D(a™)] 

G/3 b 



G/3 b n 



2nb{T - To)X 



2b(T - T Q )X 

G/3 b 

A(2n + b(T Q - h)) J ' 2b(T-T )X 



and if T > ff ] + T and T > i : 



< 



A.G 



1 



6A V2f + (T -i ) 2(T-T )y 
Now, we can ensure the above is at most e if we require: 

Pb f^G „ n 

- T ( a! ~ 



T-To> 



fh G 
b Ae G ' 



(32) 



(33) 

(34) 
(35) 



Combining the requirements (29), (34) and (35) with T > [~tt] +To and To > if); and recalling that for the hinge 
loss G = 1 and with = we have e^' = T>(a*) — D(0) < 1 — = 1 gives the requirements in Theorem 2. □ 



